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Abstract 

In this paper, we propose a different insight to analyze AdaBoost. This anal¬ 
ysis reveals that, beyond some preconceptions, AdaBoost can be directly used 
as an asymmetric learning algorithm, preserving all its theoretical properties. 
A novel class-conditional description of AdaBoost, which models the actnal 
asymmetric behavior of the algorithm, is presented. 
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1. Introduction 

Asymmetry is present in many real world pattern recognition applica¬ 
tions. Medical diagnosis, disaster prediction, biometrics, frand detection, 
etc. have obvionsly different costs associated with the different kinds of mis¬ 
takes (false positives and false negatives) implicitly related to each decision. 
Bnt asymmetry is not only connected to the direct cost of a mistake. Many 
problems have nnbalanced class priors, where one of the classes is extremely 
more freqnent than the other one, or it is easier to sample. This kind of 
problems may reqnire classifiers capable of focusing their attention in the 
rare (bnt most valnable) class, instead of trying to find hypothesis that in 
general fit well to data (mainly driven by the prevalent class). 

From its original pnblication, boosting algorithms (Schapire, 1990) and 
specifically AdaBoost (Frennd and Schapire, 1997) have drawn a lot of atten¬ 
tion of the pattern recognition commnnity. Its strong properties and theo- 
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retical guarantees, its tendency to non-overfitting and its promising practical 
results, have focused the interest in this family of algorithms (e.g., Schapire 
et ah, 1998a; Schapire and Singer, 1999; Friedman et ah, 2000; Mease and 
Wyner, 2008a; Viola and Jones, 2004) both from the theoretical (different 
interpretations, modihcations, discussions...) and practical points of view. 

In the literature, several modihcations of AdaBoost have been proposed to 
deal with asymmetry (Karakoulas and Shawe-Taylor, 1998; Fan et ah, 1999; 
Ting, 2000; Viola and Jones, 2004, 2002; Sun et ah, 2007; Masnadi-Shirazi and 
Vasconcelos, 2007). Viola and Jones in their face detector framework (2004), 
use a validation set to modify the AdaBoost strong classiher threshold in 
order to trade off false positive and detection rates. Nevertheless, as they 
stated, it is not clear whether this change preserves the original training and 
generalization guarantees of AdaBoost (Viola and Jones, 2004) and the weak 
classihers selection is not optimal for an asymmetric task (Viola and Jones, 

2002) . Most of the other proposed algorithms (Karakoulas and Shawe-Taylor, 
1998; Fan et ah, 1999; Ting, 2000; Viola and Jones, 2002; Sun et ah, 2007) try 
to reach asymmetry based on direct manipulations of the weight distribution 
update rule. These are heuristic modihcations of the algorithm, but not a 
full reformulation of AdaBoost for asymmetric classihcation problems. On 
the other hand, the more recent Asymmetric Boosting algorithm (Masnadi- 
Shirazi and Vasconcelos, 2007) hnds a new solution to the problem based 
on the Statistical View of Boosting (Friedman et ah, 2000). Their result is 
theoretically solid, but the hnal algorithm is far more complex and computing 
demanding than the original AdaBoost. 

Eventhough some studies (Freund and Schapire, 1997; Zadrozny et ah, 

2003) mention that the incorporation of unbalanced initial weights could 
lead to a cost-sensitive version of AdaBoost, subsequent works insist that this 
is not enough to reach effective asymmetry (Viola and Jones, 2002; Mease 
et ah, 2007; Sun et ah, 2007; Masnadi-Shirazi and Vasconcelos, 2007) swelling 
the number of different asymmetric boosting algorithm variants. Meanwhile, 
standard AdaBoost remains being explained with an uniform initial weight 
distribution (e.g., Schapire and Singer, 1999; Friedman et ah, 2000; Schapire 
et ah, 1998a; Fan et ah, 1999; Ting, 2000; Sun et ah, 2007; Masnadi-Shirazi 
and Vasconcelos, 2007; Freund and Schapire, 1999; Polikar, 2006, 2007). To 
the best of our knowledge, a formal explanation of the consequences of using 
asymmetric initial weights on AdaBoost has not been provided, either in 
one way (they lead to effective asymmetry) or the other (they are dehnetely 
useless), so we think that some light must be shed in order to clarify the 
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actual asymmetric learning capabilities of AdaBoost. 

In this paper we propose a new perspective to analyze AdaBoost in a 
class-conditional way. This analysis suggests that, only with an unbalanced 
class-conditional initialization of the weight distribution, AdaBoost is, by 
itself, a theoretically sound asymmetric classihcation algorithm. Based on 
class error decomposition, our analysis offers a new model to understand 
AdaBoost behavior and how it really deals with asymmetry in an additive 
round-by-round scheme. In fact, weights initialization is no more than a way 
to modify the data distribution seen by the learner and, as we will see, it can 
be easily shown to shape the error bonnd that sets AdaBoost minimization 
goal. One key point of onr work is that it is merely an analysis, so AdaBoost 
is nnchanged. As a conseqnence, all the algorithm theoretical properties (re¬ 
lated to training and generalization errors) remain intact, which has not been 
clearly reported on the other modihcations in the literature. Our analysis is 
inspired by the generalized derivation of Schapire and Singer (1999), close 
to the original (Frennd and Schapire, 1997) and specially intnitive and illus¬ 
trative for our purpose. The Statistical View of Boosting (Friedman et ah, 
2000) and all its snbseqnent controversy (Mease and Wyner, 2008a; Bennett 
et ah, 2008; Mease and Wyner, 2008b) is left aside, althongh an analogons 
conclusion could be derived from it. 

The paper is organized as follows: in the next section we will describe 
AdaBoost original algorithm and its relationship with asymmetry. Section 3 
will detail our novel class-dependant interpretation, its analysis and some ex¬ 
perimental results which show the actual asymmetric behavior of AdaBoost. 
Finally, Section 4 inclndes the main conclusions drawn from this analysis. 

2. AdaBoost 

In this section we will analyze the original AdaBoost dehnition and how 
it has usnally been adapted to asymmetric learning. 

2.1. Algorithm 

Given a set of n training examples {xi,yi) from which the m first are 
positives {i/i = 1}™ i and the rest are negatives {i/i = —IjiLm+n AdaBoost is 
a boosting algorithm whose goal is learning a strong classiher H[x) based on 
an ensemble of weak classihers ht{x) combined in a weighted voting scheme. 
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H{x) = sign (/(x)) = sign 


( 1 ) 


y^^athtjx) 

t=i 

To achieve this, a weight distribution Dt{i) is dehned over the whole 
training set. In each learning round t the weak learner selects the best 
classiher according to the weight distribution, and this weak classiher is added 
to the ensemble weighted by a goodness parameter at (2) depending on a 
correlation term (3). Once every weak classifier is selected, the weight 
distribution is updated according to its performance, following the rule on 
(4) (where Zt is a normalization factor which ensures Dt{i) is an actual 
distribution). The process can be repeated iteratively until a fixed number 
of rounds is reached, or when the obtained strong classifier achieves some 
performance goal. 



(2) 

n 

n = Dt{i)yiht{xi) 

2=1 

(3) 

Dt{i) exp {-atyiht{xi)) 

(4) 

n 

Zt = '^ Dt{i) exp {-atyiht{xi)) 

(5) 


2=1 


This framework can be seen (Schapire and Singer, 1999) as an additive 
(round-by-round) minimization process of an exponential bound on the train¬ 
ing error of the strong classifier. The bounding process is based on (6), and 
all the above expressions (including the weight update rule) can be derived 
from it. 


H{xi) ^ Vifixi) < 0 ^ exp {-yif{xi)) > 1 (6) 

Following the procedure used by Schapire and Singer (1999), the final 
bound of the training error obtained by AdaBoost is expressed as (7). The 
additive minimization of Et can be seen as finding, round by round, the weak 
hypothesis ht that maximizes r*, that is maximizing the correlation between 
labels {yi) and predictions {ht) weighted by Dt{i). 
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( 7 ) 


T T 

Et < Zt < a/I — = Et 

t=l t=l 

For the sake of simplicity and clarity in our analysis, we will focus on the 
discrete version of the algorithm. In that case weak hypothesis are binary 
Hi G { —1,+1}, and the minimization process is equivalent to selecting the 
weak classiher with less weighted error et (8)h In this case, the last inequality 
on (7) becomes an equality, and parameter at can be rewritten (9) in terms 
of et . 


e* = X] Dtii)IKxi) ^ ^ Dtii) (8) 

i=l nok 

This simplihcation doesn’t prevent our analysis from being extended to 
other AdaBoost variations. 

2.3. AdaBoost and Asymmetry 

AdaBoost is usually seen as a learning procedure driven by misclassihca- 
tion on the training set. In that sense, the exponential bound to minimize 
must be dehned (10) following the guidelines proposed by (Schapire and 
Singer, 1999). Graphically, we can visualize this bounding process in Figure 

1 . 


Et = ^ ytj 

1=1 

^ n T 

< - y] exp {-yifixi)) = T\Zt = ET 


( 10 ) 


From this point of view, AdaBoost is an algorithm with a symmetric 
behavior if the number of instances in the training set is the same for the two 


^Notation: Operator |a] is 1 if o is true and 0 otherwise. The term ‘ok’ refers to those 
training examples in which the result of the weak classifier is right {i : h{xi) = yA and 
‘nok’ when it is wrong {i : h(xi) A vA- 
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Vifixi) 


Figure 1: AdaBoost exponential training error bound. Horizontal axis represents the 
absolute value of the final score of the strong classifier, with negative sign for errors and 
positive for correct classifications. Vertical axis is the loss related to misclassification and 
its exponential bound. 


classes, or biased to the prevalent class otherwise. Consequently, AdaBoost 
couldn’t be a cost-sensitive algorithm unless the training data set is resampled 
accordingly. 

Based on this seemingly balanced nature, several modihcations of Ada¬ 
Boost have been proposed in order to adapt the algorithm to cost-sensitive 
problems. Most of them (Karakoulas and Shawe-Taylor, 1998; Fan et ah, 
1999; Ting, 2000; Sun et ah, 2007; Viola and Jones, 2002) are based on 
modifying the weight update rule in an asymmetric (class-conditional) way. 
However it is not clear how these changes can affect the theoretical properties 
of AdaBoost since, as was mentioned above, the update rule is a consequence 
of the minimization process and not an arbitrary starting point of it. 

This perspective is supported by the fact that AdaBoost is usually ex¬ 
plained with a hxed uniform initial weight distribution = 1/n) (e.g., 

Schapire and Singer, 1999; Friedman et ah, 2000; Schapire et ah, 1998a; Fan 
et ah, 1999; Ting, 2000; Sun et ah, 2007; Masnadi-Shirazi and Vasconcelos, 
2007; Freund and Schapire, 1999; Polikar, 2006, 2007). Nevertheless some 
initial works by Freund and Schapire (1997) leave this distribution free to be 
controlled by the learner. In our explanation of the algorithm in Section 2.1 
we deliberately didn’t mention anything about the initialization of the weight 
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distribution. So, what would really happen if the initial distribution was a 
generic one? Changes of the initial distribution were used to deal with cost- 
related utility functions (Schapire et ah, 1998b), and cost-sensitive weight 
initializations bonded to different changes in the weight update rule were 
also used by Karakoulas and Shawe-Taylor (1998), Fan et ah (1999) or Ting 
(2000). Viola and Jones (2002) proposed a hrst modihcation of AdaBoost 
equivalent to an asymmetric modihcation of the initial weights. Neverthe¬ 
less, they discard this approximation arguing that the induced asymmetry is 
fully absorbed by the hrst round, remaining the rest of the process entirely 
symmetric. Their hnal proposal (coined as AsymBoost) was fairly spreading 
the desired asymmetry among a predehned number of rounds. 

Though it is not widely appreciated, it can be easily shown that the error 
bounded and minimized by AdaBoost is actually a weighted error depending 
on the initial weight distribution. The only change with regard to the usual 
bound (10), in which initial nniform weights have been taken out of the 
summation, is that generic initial weights must be kept inside the snmmation 
during the bounding process (11). 


i=l 
n 

< ^T>i(f)exp(-|/i/(a;i)) 
i=l 

All the rest of the process remains identical to that explained by Schapire 
and Singer (1999), consequently gnaranteeing all the theoretical properties 
of AdaBoost with regard to training and generalization errors. 

3. Revisiting AdaBoost 

In this section we will show our novel class-conditional interpretation 
model for AdaBoost. This generalized analysis will shed light on the class- 
dependant behavior of AdaBoost sketched in the previous section. 

3.1. Asymmetric Interpretation 

To derive our new interpretation of AdaBoost, instead of the initial weight 
distribntion used in the original AdaBoost formnlation, we dehne a set of 


T ( 11 ) 

= Zt = Et 
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parameters which contain exactly the same information of the former distri¬ 
bution. 

• Asymmetry: 

m 

r>i(*) = 7 € (0,1) (12) 

i=l 

• Class-conditional distributions: 

forf = l,...,m (13) 

7 

for i = m-I-1,..., n (14) 

1-7 

If we put this new set of parameters into the training error expression 
(11) we will be able to decompose it in terms of its positive and negative 
class error components: 


Et = '^Di{i)lH{xi) ^ yij = ^ yij 

i=l i=l 

” fi^i 

+ (1 - 7 ) X] Di_{i)lH{xi) ^ y^l 

i=m+l 

= 7 Et + (1 — 7) Et - 

Bounding (15) with the usual exponential approximation, we can also 
obtain the error bound as the combination of two class-conditional partial 
error bounds: 


Ej' — 7 Ej'jf. 4 “ (1 — 7 ) Ej'_ 

m 

i=l 

n 

+ ( 1 - 7 ) Di_{i)e^^{-yif{xi)) 

i=m-\-l 

= ^ -^T+ (1 — T) -^T— “ Ej' 


( 16 ) 




Figure 2: AdaBoost training error and its exponential bound split into two class- 
conditional components for an asymmetry of 7 = 2/3. 


In Figure 2 we can see the defined weighted partial error bounds {Et+ 
and Et-) for an asymmetry of 7 = 2/3 (assuming uniform class-conditional 
distributions). Asymmetry becomes evident. 

As it can be seen, the two partial bounds have expressions formally iden¬ 
tical to that of the general bound used in the original AdaBoost (11), so an 
equivalent update rule can be derived for each class error: 




Dt+{i)eyiv {-atyiht{xi)) 

Zt+ 

Dt-{i) exp {-atyiht{xi)) 
Zt- 


(17) 

(18) 


where 


Zt+ = 'Yli Z>t+{i) exp {-atViktixi)) (19) 

i=l 

n 

Zt-= ^ Dt-{i) exp {-atViktixi)) ( 20 ) 
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We will also define two new parameters Pt+ and Pt- which, unraveling 
the update rules, can be expressed as follows: 


t-i 


t-i 


Pt+ = Zk+ = ^ I Di+{i) exp {-akVihkixi)) 
k=l 2=1 V k=l / 

t—1 n / t—1 

Pt- = Yl^k- = ^ Di_(i) JJexp {-akVihkixi)) 


k=l 


i=m+l 


k=l 


( 21 ) 

( 22 ) 


These parameters (we will discuss later about their meaning) allow us to 
express the partial error bounds in a compact form: 


Et+ — Pt+ Zt+ ( 23 ) 

Et- = Pt- Zt- ( 24 ) 

The global error bound of the original view of AdaBoost, Ef, can also be 
analogously rewritten by dehning an equivalent parameter Pt for the whole 
training set: 

£—1 n / £—1 \ 

Pt = WZk = '^\Di{i)^ exp {-akVihkixi)) j 

fc=l i=l \ k=l / 

Et = PtZt 

As a result, the error bound to minimize can be expressed as 

Et = 'y Et+ + (1 - 7 ) Et- 

= 7 Pt+Zt+ + (1 — 7 ) Pt-Zt- 

Bearing in mind that in each round the only variable parameters are 
and Zt- (7 is hxed from the beginning, and Pt depends only on the previous 
rounds), we can minimize Et using a procedure analogous to that proposed 
by Schapire and Singer (1999). While the minimization is exactly the same 
as in the original case {dEt/dat = 0) the process can be entirely performed 
in terms of the class-conditional parameters and allows us to obtain the next 
expression of the error to be minimized round by round: 


(25) 

(26) 
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(28) 


Q — 


iPt 




€t+ + 


(1 - 7) P,- 


7-Pi++ (1-7)-Pj- 7-Pt+ + (l-7)-Pi- 

Where et+ and et_ are the partial weighted errors per class: 


Ct- 


et+ = X] A+(*)[h(a;i) ^ Vij ■ 

i=l 

n 

Ci-= X] Dt_{i)lh{xi) ^ Vi} 

i=m-\-l 

The expression for at is: 


pos nok 

neg nok 


(29) 

(30) 


OCt 


In 


V 

= - In 
2 


( ^Pt+ Dt4^) + a - 7) Pt- Y1 \ 

pos ok neg ok 

7P,+ E A+W-K1-7)P.- E 

pos nok neg nok 

1 ~ Ct 




And the hnal training error bound, can be expressed as: 


(31) 


Et < Et — Zt — \/l — 


rd 


rt = 


1 Pi 


t=i t=i 

m 


t+ 


^ Dt+{i)yiht{xi) 


7 Pt+ + (1 - 7 ) Pt- “ 

+ ^ P. E 


7P,+ + (i-7)a- 


(32) 


(33) 


As we can see, all the magnitudes (e*, at and rt) are systematically decou¬ 
pled in two components according to the global asymmetry and the classiher 
behavior over each class. The key concept is that expressions (28), (31) and 
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(33) are actually the same as those of the original AdaBoost formulation ( 8 ), 
(9) and (3), respectively. On one hand, the derivation is equivalent to the 
original with the only exception that weights are decomposed in three pa¬ 
rameters (12), (13), (14). On the other hand, during the derivation process 
we can obtain equivalences (34), (35), (36) which appropriately replaced on 
the original AdaBoost expressions lead us to the new ones. 


7 Pt+ + (1 - 7 ) Pt- = Pt (34) 

7 Pt+Dt+{i) = PtDt{i), for i = 1,..., m (35) 

(1 - 7 ) Pt-Dt-{i) = PtDt{i), for i = m -h 1,..., n (36) 


3.2. Asymmetric Error Analysis 

The initial weight decomposition in our analysis allows us to decouple the 
global weight distribution information in two levels which were always mixed 
in the original AdaBoost formulation: 

• Class level'. The asymmetry parameter 7 models the global cost of the 
positive class over the negative one. From a practical point of view, this 
parameter can be used to introduce asymmetry in the strong classiher. 


• Example level: The class-conditional initial weight distributions (/I 1 + 
and Di_{i)) model the relative relevance of each example inside its own 
class. So, being two separate distributions, they are isolated from the 
asymmetry of the problem. 

This two-level categorization can be extrapolated to the error bound min¬ 
imized by AdaBoost in each round, yielding us a new insight. 
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EXAMPLE LEVEL 


Et = 


CLASS LEVEL 



Global Asymmetry Previous Rounds Asymmetry 


Effective Asymmetry 

V 

POSITIVES BEHAVIOR 

CLASS LEVEL 


(1-7) 


Pf- 


Global Asymmetry Previous Rounds Asymmetry 

'-v-' 

Effective Asymmetry 


NEGATIVES BEHAVIOR 



Current Round 


EXAMPLE LEVEL 



Current Round 


(37) 


The bound consists of two formally identical terms, one per class (positive 
and negative). Each term has two main components: one on the class level 
and another one in the example level. 


• The class level dehnes the effective asymmetry demanded for the cur¬ 
rent round. It can be seen as the global desired asymmetry modulated 
by the past asymmetric behavior of the classiher (encoded by cumu¬ 
lative errors Pt+ and Pt-)- It only depends on the previous rounds. 


• The example level is related to the weighted error of the current weak 
classiher. Weight distributions {Dt+{i) and Dt-{i)) are updated, round 
by round, to encode the effective relative relevance of each example 
totally apart from the class behavior. It depends both on the previous 
and current rounds. 

As we can see, the effective asymmetry of each round will depend on 
the asymmetry of the previous ones, so AdaBoost goal is to iteratively hnd 
the weak hypothesis which, given its predecessors, best helps to the global 
asymmetry minimizing the training error bound. Asymmetry is reached in a 
round-by-round adaptive way, without any previous restriction on the hnal 
number of rounds. 
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This error bound interpretation can open the door to new modihcations 
of AdaBoost based, for example, on tuning the global and past asymme¬ 
try contributions in order to achieve different asymmetric behaviors along 
rounds. 

3.3. Algorithm 

Once we have seen the actual asymmetric properties of AdaBoost when 
using a generic initial distribution, the complete algorithm can be reformat¬ 
ted as in Table 1. 

The only change regarding to the algorithm description usually found in 
the literature is that the initial weight distribution is not necessarily uni¬ 
form. Here, we initialize it in terms of an asymmetry parameter ( 7 ) and two 
class-conditional distributions (Di+(i) and Di_{i)), which can be uniform 
(all the examples of each class weight the same) or not (some examples are 
emphasized). 

3 . 4 .. Experiments 

In order to illustrate our analysis with empirical results on the asymme¬ 
tric behavior of AdaBoost with unbalanced initial weight distributions, we 
performed three kinds of experiments. For these experiments we have dehned 
the Asymmetric Error (AsErr) as the cost-sensitive error of the classiher: the 
weighted average of positives (PorErr) and negatives (NegErr) error rates or, 
what is the same, the weighted average of false negatives (FN) and false 
positives (FP) rates. 


AsErr = 7 ■ PosErr -|- (1 — 7 ) ■ NegErr 
= 7 ■ FN + (1 - 7 ) • FP 

At first, we used the separable set of Figure 3 (inspired by that used 
by Viola and Jones, 2002) in which positives are concentrated in a circular 
area and negatives surround them, following the same uniform distribution in 
both cases. Weak classifiers are stumps in the linear two-dimensional space. 

AdaBoost behavior for this training set and different asymmetries (7 = |, 
|, I and |) is shown in Figure 4. We can see that, as the asymmetry grows, 
positive error bound and respective positive training/test errors tend to be 
lower, while negative error bound and respective negative training/test errors 
tend to be higher. This behavior doesn’t prevent the classiher from asymp¬ 
totically improving itself round by round approaching to zero training error 
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Table 1: Discrete AdaBoost generalized formulation for asymmetric classification prob¬ 
lems. 


Given: 


• A set of positive examples: (xi, yi) = (xi, 1),..., {xm, 1) 

• A set of negative examples: (xi,yi) = {xm+i, —1), ■ • ■, (x„, —1) 

• An asymmetry parameter: 7 € (0,1). 

• Two weight distributions over the positive (Di+(i)) and negative examples 
(Di_(*)). 

Initialize the global weight distribution as: 

• Di{i) = j Di+{i) for i = 1,..., TO 

• Di(i) = (1 — 7 ) Di_(i) for i = m-I-1,..., n 

For t = 1,... ,T (or until the strong classifier reaches some performance goal): 

• Select the weak classifier ht{x) with the lowest weighted error 

n 

^t = ^Dt{i){ht{x,) ^yi\='^Dt{i) 

i—1 nok 


Calculate 


1 , (l-et 


Update the weight distribution 




Dt{i) exp {-atyihtjxi)) 
Yli=i Dt{i) exp {-atyiht{xi)) 


The final strong classifier is: 


H{x) = sign [ '^athtix) 
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Figure 3: Training set (a) and example weak classifiers over the test set (b) used to 
illustrate our asymmetric analysis of AdaBoost. Positive examples are marked as ‘+’, 
while ‘o’ are the negative ones. 

classifiers, due to the separable nature of the classihcation problem. The 
key advantage of this approach is that the error evolution follows an unbal¬ 
anced behavior, allowing the user to stop training at any iteration, with the 
theoretical conhdence of having minimized the error bound with the desired 
asymmetry no matter in which iteration we are (opposed to Asymboost (Vi¬ 
ola and Jones, 2002) philosophy). This can be very useful for flexible building 
of cascaded classihers as the ones proposed by (Viola and Jones, 2004). 

We also run this experiment with a non-separable set as shown in Figure 

5 and for the same different asymmetries (7 = | and |) (Figures 

6 and 7). We can see that, due to the overlapping between classes (they 
are non-separable), error curves tend to a working point different to that 
of the previous experiment. In any case, the obtained behaviors are clearly 
asymmetric along the whole evolution of the boosted classihers, and the 
degree of asymmetry is effectively managed by the 7 parameter. 

Finally we have also conducted a more extensive experiment using both 
synthetic and real datasets to obtain numerical results verifying our hypoth¬ 
esis. The strategy we have followed is leave-one out eross-validation. Thus, 
iteratively selecting every example of a dataset, a classiher is trained over 
the remaining elements and tested over the selected one. This procedure is 
repeated for all the examples, all the datasets and all the desired 7 param¬ 
eters, so that overall performance hgures can be computed. Tables 2 and 5 
summarize the obtained performance over the synthetic dataset with over¬ 
lapping in Figure 5 and some real asymmetric datasets (Credit, Diabetes 
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Error Bound Error Bound Error Bound Error Bound 





(a) 7 = i 




Number of Rounds 

(b)7 = f 




g 

m 

O) 

c 



Number of Rounds 


(c) 7 = 


2 

3 






Number of Rounds Number of Rounds 

(d)7 = | 


Figure 4: Evolution of training error bounds (left column), training errors (center column) 
and test errors (right column) through 100 rounds of AdaBoost training and different 
asymmetries, using the set without overlapping in Figure 3. 
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Figure 5: Training set with overlapping (a) and example weak classifiers over the test set 
(b). Positive examples are marked as ‘+’, while ‘o’ are the negative ones. 


and Spam) extracted from the UCI Machine Learning Repository (Frank 
and Asnncion, 2010 ). As can be seen, in all cases a consistent asymmetric 
behavior is reached, being also progressive depending on 7 . 


Table 2: Classifier behavior (false negatives, false positives, classification error and asym¬ 
metric error) for different asymmetric requirements over the synthetic cloud dataset with 
overlapping in Figure 5. 


7 

Synthetic cloud 

FN 

FP 

ClErr 

AsErr 

1/2 

31.60% 

29.20% 

30.40% 

30.40% 

3/5 

26.80% 

38.00% 

32.40% 

31.28% 

2/3 

22.00% 

42.00% 

32.00% 

28.67% 

7/8 

7.60% 

66.40% 

37.00% 

14.95% 


Table 3: Classifier behavior (false negatives, false positives, classification error and asym¬ 
metric error) for different asymmetric requirements over real datasets extracted from the 
UCI Machine Learning Repository (Frank and Asuncion, 2010). 


7 

Credit 

Diabetes 

Spam 

FN 

FP 

ClErr 

AsErr 

FN 

FP 

ClErr 

AsErr 

FN 

FP 

ClErr 

AsErr 

1/2 

28.67% 

26.86% 

27.40% 

27.76% 

32.09% 

22.40% 

25.78% 

27.24% 

4.84% 

6.18% 

5.37% 

5.51% 

3/5 

22.67% 

37.43% 

33.00% 

28.57% 

22.39% 

28.60% 

26.43% 

24.87% 

4.16% 

7.06% 

5.30% 

5.32% 

2/3 

18.67% 

43.43% 

36.00% 

26.92% 

19.78% 

32.20% 

27.86% 

23.92% 

3.84% 

8.38% 

5.63% 

5.35% 

7/8 

6.00% 

69.14% 

50.20% 

13.89% 

10.07% 

53.00% 

38.02% 

15.44% 

2.33% 

11.75% 

6.04% 

3.51% 
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Figure 6: Evolution of training error bounds (left column), training errors (center column) 
and test errors (right column) through 100 rounds of AdaBoost training and different 
asymmetries, using the set with overlapping in Figure 5. 
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Figure 7: Classification results over the test set with overlapping (Figure 5) for different 
asymmetries. As in Figure 3 true positives are marked as ‘+’, and ‘o’ are true negatives. 
However, in this case, cyan colored marks represent positive classifications while blue ones 
represent negative classifications. 
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3.5. Discussion 

Previous sections reveal that AdaBoost can be by itself an asymmetric 
learning algorithm, following its original additive round-by-round updating 
behavior. Our proposed change of perspective yields several consequences: 

• The initial weight distribution is more than the distribution seen by 
the first weak classiher. It is the distribution which weighs the global 
error bound to be minimized by AdaBoost. Any asymmetry in this 
initial weight distribution is an effective way to introduce asymmetry 
in the strong classiher goal. 

• This kind of asymmetry is asymptotic for the whole classiher and the 
number of training rounds can be as hexible as in the original case 
(unlike AsymBoost, Viola and Jones, 2002, which rigidly spreads the 
asymmetry in a predehned nnmber of rounds). Among other advan¬ 
tages, this makes possible, once a strong classiher is trained, to cut it 
out at whatever round we consider, with the certainty that the error 
bound has been minimized taking the desired global asymmetry into 
account. Moreover, it can be specially useful for cascaded classihers as 
those used for object detection (Viola and Jones, 2004), in which each 
stage (each strong classifer) must be markedly asymmetric and as short 
as possible, in order to improve rejecting efficiency (and conseqnently 
the real-time ability of the system). 

• Asymmetry can be reached without changing the weight update rule, as 
opposed to the most of the asymmetric AdaBoost modihcations in the 
literature. It is argued that such a modihcation is needed because Ada¬ 
Boost updates weights of examples from different classes in the same 
way, only distinguishing between correctly and incorrectly classihed 
ones. This is true, but it must be taken into acconnt that, before the 
hrst weight distribution npdate, AdaBoost must have selected a first 
weak classifier hi{x) and a goodness parameter ai according to the 
initial weight distribution Di{i), which stores the desired asymmetry 
information. Conseqnently hi{x) and ai implicitly encode asymmetry 
information, and both parameters are just the ones that manage the 
update rule. The result is that asymmetry is indirectly present in the 
usual weight update rule and, as seen in section 3.2, all the subseqnent 
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iterations can be seen as a round-by-round asymmetry adaptive pro¬ 
cess. Any additional class-dependant change in the weight update rule 
may emphasize, in a more or less controlled way, the described asym¬ 
metric behavior but in those cases it is not clear how it would affect to 
the theoretical properties of AdaBoost. 

• The whole formal guarantees provided by AdaBoost remain intact. 


4. Conclusion 

In this paper we have introduced a new insight on the asymmetric learn¬ 
ing capabilities of AdaBoost, in which the symmetric case can be seen as 
a particularization (when asymmetry parameter 7 = 0.5). Beyond some 
preconceptions, the only needed change with regard to the usual formula¬ 
tion is how the initial weights are initialized. We have shown, using a novel 
class-conditional interpretation of the error bound, that the asymmetric be¬ 
havior reached is asymptotic with the number of rounds and it works, as 
the whole algorithm, in an additive round-by-round way. The weight up¬ 
date rule doesn’t need to be changed and all the formal guarantees remain 
intact. Our error bound interpretation can also be useful to develop new 
AdaBoost modifications based on adjusting the different asymmetry compo¬ 
nents (both on the class and/or example levels). We have not presented a 
new algorithm... it is just AdaBoost! 
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